All single-cell RNA-seq protocols and technologies require library preparation prior to sequencing on a platform such as Illumina. Here, we present the first report to utilize the BGISEQ-500 platform for scRNA-seq, and compare the sensitivity and accuracy to Illumina sequencing. We generate a scRNA-seq resource of 468 unique single-cells and 1,297 matched single cDNA samples, performing SMARTer and Smart-seq2 protocols on mESCs and K562 cells with RNA spike-ins. We sequence these libraries on both BGISEQ-500 and Illumina HiSeq platforms using single- and paired-end reads. The two platforms have comparable sensitivity and accuracy in terms of quantification of gene expression, and low technical variability. Our study provides a standardised scRNA-seq resource to benchmark new scRNA-seq library preparation protocols and sequencing platforms.
SCQUA is a python package that we developed for analyzing the single cell RNA Sequencing quality.
%pylab inline
from SCQUA import *
ls ../
Figs include the resulted figures
data include all the raw data
jup include the jupyter notebook to process the data and make the figures
pdata include all the processed data.
ls ../data/
Salmon outputs 3 information, the counts, tpm and qc metrics. We concatenate these information in *cts.gz, *tpm.gz and *qc.gz.
detection limit (sensitivity) and accuracy based on spike-ins can be calculated.
ercc = get_ERCC()
sirv = get_SIRV()
spike = pd.concat([ercc,sirv])
Get results for all the datasets.
for f in iglob('../data/*_tpm.csv.gz'):
name = f.split('/')[-1].replace('_tpm.csv.gz',"")
print(name)
cts = pd.read_csv(f.replace('tpm','cts'), index_col=0)
tpm = pd.read_csv(f, index_col=0)
phn = pd.read_csv(f.replace('tpm','qc'), index_col=0)
df = get_result(tpm, ercc, sirv, spike)
phn = pd.concat([phn,df,cts.loc[cts.index.str.startswith("ENS")].T], axis=1)
phn["Total_counts"] = cts.loc[cts.index.str.startswith("ENS")].sum()
phn["study"] = name
phn.to_csv("../pdata/%s.csv"%name)
This file is annotation of the all the cells sequenced.
phn = pd.read_csv("../data/phn.csv",index_col=0)
phn.head()
phn.tail()
map the annotation
for f in iglob("../pdata/*.csv"):
print(f)
fout = f.replace("pdata","pdata1")
df = pd.read_csv(f,index_col=0)
df = df.loc[df.index.isin(phn.index)]
dfp = phn.loc[df.index]
df = pd.concat([df, dfp], axis=1)
df.to_csv(fout)
(B) Single-cell detection limit (Sensitivity) of mESC cells, downsampled across two orders of magnitude from SMARTer and two Smart-seq2 replicates (633 samples). The single-cell sensitivities are largely similar between different library preparation across scRNA-seq protocols. (C) Single-cell accuracy of mESC cells, downsampled across two orders of magnitude for SMARTer and two Smart-seq2 replicates (633 samples). The grey dotted lines indicate downsampling at different read depths per cell, while red line indicates saturation per cell.
dfs = []
for f in iglob("../pdata1/*_ds_*"):
df = pd.read_csv(f, index_col=0)
df["id"] = df.index
df.index = df.study+"_"+df.id
df = df[["id",'Cell','Fullname','num_processed',\
'detection_limit','accuracy','Protocol','Batch','Protocol2']+
df.columns[df.columns.str.startswith('ENS')].tolist()]
dfs.append(df)
df = pd.concat(dfs,axis =0)
df.shape
df.head()
df=df[df.Protocol.str.startswith("S")]
use only the protocols in sanger
df = df.loc[~(df.Protocol2.str.startswith('BGISEQ-500 K562')|\
df.Protocol2.str.startswith('Hiseq4000')|\
df.Protocol2.str.startswith('BGISEQ-500 ESC'))]
df.Protocol.value_counts()
df.Protocol2.value_counts()
df=df.replace([np.inf, -np.inf], np.nan).dropna(axis=0, how='any')
df.to_csv("../pdata1/Fig1BC.csv")
df = pd.read_csv("../pdata1/Fig1BC.csv", index_col=0)
figsize(5,5)
matplotlib.rcParams.update({'font.size': 14})
names = ['BGISEQ-500 Smart-Seq2 rep1','HiSeq-2000 Smart-Seq2 rep1',
'BGISEQ-500 Smart-Seq2 rep2','HiSeq-2000 Smart-Seq2 rep2',
'BGISEQ-500 SMARTer','HiSeq-2000 SMARTer']
ax = fit_sensitivity(df,key2="num_processed",key3='Protocol2', \
yscale='log', xlabel = 'Sequenced reads per single-well',\
ylabel = 'Detection limit \n (# of molecules)', \
ylim = [0, 1e4], xlim = [500, 1e8], \
xlabelsize = 18, ylabelsize = 18, \
title = 'Sensitivity', titlesize = 20, names = names)
figsize(5,5)
matplotlib.rcParams.update({'font.size': 14})
names = ['BGISEQ-500 Smart-Seq2 rep1','HiSeq-2000 Smart-Seq2 rep1',
'BGISEQ-500 Smart-Seq2 rep2','HiSeq-2000 Smart-Seq2 rep2',
'BGISEQ-500 SMARTer','HiSeq-2000 SMARTer']
ax = fit_sensitivity(df,key2="num_processed",key3='Protocol2', \
yscale='log', xlabel = 'Sequenced reads per single-well',\
ylabel = 'Detection limit \n (# of molecules)', \
ylim = [0, 1e4], xlim = [500, 1e8], \
xlabelsize = 18, ylabelsize = 18, \
title = 'Sensitivity', titlesize = 20, names = names, colordots = True)
ax = fit_sensitivity(df, fun = 'accuracy',key1="accuracy",key2="num_processed",key3='Protocol2', \
xlabel = 'Sequenced reads per single-well',\
ylabel = 'Accuracy \n (pearson correlation \n coefficient)', \
ylim = [0, 1], xlim = [500, 1e8], \
xlabelsize = 18, ylabelsize = 18, \
title = 'Accuracy', titlesize = 20, names = names)
ax = fit_sensitivity(df, fun = 'accuracy',key1="accuracy",key2="num_processed",key3='Protocol2', \
xlabel = 'Sequenced reads per single-well',\
ylabel = 'Accuracy \n (pearson correlation \n coefficient)', \
ylim = [0, 1], xlim = [500, 1e8], \
xlabelsize = 18, ylabelsize = 18, \
title = 'Accuracy', titlesize = 20, names = names, colordots=True)
(D) PCA for matched single-cell cDNA samples performed using SMARTer and two replicates of Smart-seq2 and sequenced across both sequencing platforms. Red and green colored circles indicate sequencing of matched cDNA on Illumina HiSeq2500 and BGISEQ-500 respectively. The dotted lines represent distance i.e. measure of similarity across sequencing platforms. (E) Single-cell correlations for each scRNA-seq protocol and across sequencing platforms. The correlations (R=0.52~0.70) are comparable between sequencing platforms.
df = pd.read_csv("../pdata1/Fig1BC.csv", index_col=0)
figsize(5,5)
dfs1 = []
dfs2 = []
for p, l in zip(['Smart-Seq2 rep1','Smart-Seq2 rep2','SMARTer'],[230, 120, 210]):
df1, df2 = plot_Fig1E(df, protocol = p,\
lim = l,\
xlabel = "BGISEQ-500", ylabel="HiSeq-2000",\
xlabelsize = 18, ylabelsize=18, titlesize = 18, legendsize = 16)
plt.show()
dfs1.append(df1)
dfs2.append(df2)
def plot_Fig1E(df, \
protocol = "SMARTer",\
key1 = 'Batch',\
key2 = 'Protocol2',\
key3 = 'Cell',\
key4 = 'Fullname',\
lim = 210,\
cutoff = 1000, \
xlabel = None, ylabel=None,\
xlabelsize = None, ylabelsize=None, titlesize = None, legendsize = None):
dfx = df[df[key1] == protocol]
dfx = dfx[dfx.index.str.find('e6')>0]
df1 = dfx[dfx[key2].str.startswith('BGISEQ-500')].sort_values(by=key3)
df2 = dfx[dfx[key2].str.startswith('HiSeq-2000')].sort_values(by=key3)
df1 = df1[df1[key4].isin(df2[key4])]
df2 = df2[df2[key4].isin(df1[key4])]
from sklearn.decomposition import PCA
from sklearn.preprocessing import scale
dfa = pd.concat([df1,df2], axis = 0)
exprs = np.log1p(dfa.T.loc[dfa.columns.str.startswith("ENS")].astype(np.float32))
pca = PCA(n_components=50)
pca_res = pca.fit_transform(scale(exprs.T, 1))
dfa["PC1"] = pca_res[:,0]
dfa["PC2"] = pca_res[:,1]
from scipy.spatial.distance import cdist
dfb = dfa[['PC1','PC2']]
dfb1 = dfb.iloc[:int((dfb.shape[0])/2)]
dfb2 = dfb.iloc[int((dfb.shape[0])/2):]
cdst = cdist(dfb1.values,dfb2.values)
# print(np.diagonal(cdst))
xx1 = df1.detection_limit[(df2.detection_limit <cutoff).tolist()]
xx2 = df2.detection_limit[(df2.detection_limit <cutoff).tolist()]
xx3 = np.diagonal(cdst)[(df2.detection_limit <cutoff).tolist()]
slope, intercept, r_value, p_value, std_err = \
scipy.stats.linregress(xx1, xx2)
from scipy.stats import spearmanr
r_value, _ = spearmanr(xx1,xx2)
plt.plot(xx1, xx2, 'k.', label = "R=%.2f"%r_value)
plt.plot(xx1[xx3>50], xx2[xx3>50], 'r.', label='')
print("outlier cells:")
print(df2.index[np.diagonal(cdst)>50])
plt.xlim(0,lim)
plt.ylim(0,lim)
if xlabel is None: xlabel = key2
if ylabel is None: ylabel = key1
if xlabelsize is None:
plt.xlabel(xlabel)
else:
plt.xlabel(xlabel, fontsize=xlabelsize)
if ylabelsize is None:
plt.ylabel(ylabel)
else:
plt.ylabel(ylabel, fontsize=ylabelsize)
if not titlesize is None:
plt.title(protocol, fontsize = titlesize)
else:
plt.title(protocol)
if legendsize is None:
plt.legend(loc='upper right')
else:
plt.legend(loc='upper right', fontsize=legendsize)
return(df1, df2)
outlier cells are listed here:
figsize(5,5)
dfs1 = []
dfs2 = []
for p, l in zip(['Smart-Seq2 rep1','Smart-Seq2 rep2','SMARTer'],[230, 120, 210]):
df1, df2 = plot_Fig1E(df, protocol = p,\
lim = l,\
xlabel = "BGISEQ-500", ylabel="HiSeq-2000",\
xlabelsize = 18, ylabelsize=18, titlesize = 18, legendsize = 16)
plt.show()
dfs1.append(df1)
dfs2.append(df2)
dfs = []
for f in ['../pdata1/bgi.csv','../pdata1/sanger.csv']:
df = pd.read_csv(f, index_col=0)
df["id"] = df.index
df.index = df.study+"_"+df.id
df = df[["id",'Cell','Fullname','num_processed',\
'detection_limit','accuracy','Protocol','Batch','Protocol2']+
df.columns[df.columns.str.startswith('ENS')].tolist()]
dfs.append(df)
df = pd.concat(dfs,axis =0)
df=df[df.Protocol.str.startswith("S")]
df = df.loc[~(df.Protocol2.str.startswith('BGISEQ-500 K562')|\
df.Protocol2.str.startswith('Hiseq4000')|\
df.Protocol2.str.startswith('BGISEQ-500 ESC'))]
df.Protocol.value_counts()
df.Protocol2.value_counts()
df=df.replace([np.inf, -np.inf], np.nan).dropna(axis=0, how='any')
df.to_csv("../pdata1/FigS2C.csv")
df.shape
def plot_Fig1E(df, \
protocol = "SMARTer",\
key1 = 'Batch',\
key2 = 'Protocol2',\
key3 = 'Cell',\
key4 = 'Fullname',\
lim = 210,\
cutoff = 1000, \
xlabel = None, ylabel=None,\
xlabelsize = None, ylabelsize=None, titlesize = None, legendsize = None):
dfx = df[df[key1] == protocol]
# dfx = dfx[dfx.index.str.find('e6')>0]
df1 = dfx[dfx[key2].str.startswith('BGISEQ-500')].sort_values(by=key3)
df2 = dfx[dfx[key2].str.startswith('HiSeq-2000')].sort_values(by=key3)
df1 = df1[df1[key4].isin(df2[key4])]
df2 = df2[df2[key4].isin(df1[key4])]
xx1 = df1.detection_limit[(df2.detection_limit <cutoff).tolist()]
xx2 = df2.detection_limit[(df2.detection_limit <cutoff).tolist()]
slope, intercept, r_value, p_value, std_err = \
scipy.stats.linregress(xx1, xx2)
from scipy.stats import spearmanr
r_value, _ = spearmanr(xx1,xx2)
plt.plot(xx1, xx2, 'k.', label = "R=%.2f"%r_value)
plt.xlim(0,lim)
plt.ylim(0,lim)
if xlabel is None: xlabel = key2
if ylabel is None: ylabel = key1
if xlabelsize is None:
plt.xlabel(xlabel)
else:
plt.xlabel(xlabel, fontsize=xlabelsize)
if ylabelsize is None:
plt.ylabel(ylabel)
else:
plt.ylabel(ylabel, fontsize=ylabelsize)
if not titlesize is None:
plt.title(protocol, fontsize = titlesize)
else:
plt.title(protocol)
if legendsize is None:
plt.legend(loc='upper right')
else:
plt.legend(loc='upper right', fontsize=legendsize)
return(df1, df2)
figsize(5,5)
dfs1 = []
dfs2 = []
for p, l in zip(['Smart-Seq2 rep1','Smart-Seq2 rep2','SMARTer'],[230, 120, 210]):
df1, df2 = plot_Fig1E(df, protocol = p,\
lim = l,\
xlabel = "BGISEQ-500", ylabel="HiSeq-2000",\
xlabelsize = 18, ylabelsize=18, titlesize = 18, legendsize = 16)
plt.show()
dfs1.append(df1)
dfs2.append(df2)
def plot_Fig1D(df1, df2, \
protocol = "SMARTer", \
key2 = 'Protocol2', \
key3 = 'BGISEQ-500', \
key4 = 'Fullname',\
xlabelsize = None, ylabelsize = None, titlesize = None):
from sklearn.decomposition import PCA
from sklearn.preprocessing import scale
df = pd.concat([df1,df2], axis = 0)
exprs = np.log1p(df.T.loc[df.columns.str.startswith("ENS")].astype(np.float32))
pca = PCA(n_components=50)
pca_res = pca.fit_transform(scale(exprs.T, 1))
df["PC1"] = pca_res[:,0]
df["PC2"] = pca_res[:,1]
g = df.groupby('Protocol2')
colors = ["#00cc99","#e0115f"]
for i, group in enumerate(g.indices):
tmp = g.indices[group]
plt.scatter(df.PC1.iloc[tmp], df.PC2.iloc[tmp], label=group.replace(protocol,""), lw = 0, c=colors[i])
from scipy.spatial.distance import cdist
for i in df[df[key2].str.startswith(key3)][key4]:
j = df[(df[key2].str.startswith(key3)) & (df[key4] == i)][['PC1','PC2']]
k = df[(~df[key2].str.startswith(key3)) & (df[key4] == i)][['PC1','PC2']]
# d1 = abs(j["PC1"].values - k["PC1"].values)
# d2 = abs(j["PC2"].values - k["PC2"].values)
# d = np.sqrt(d1*d1+d2*d2)
cdst = cdist(j[['PC1','PC2']].values,k[['PC1','PC2']].values)
d = cdst[0][0]
if d < 60:
plt.plot([j["PC1"],k["PC1"]],[j["PC2"],k["PC2"]],'k-.')
else:
plt.plot([j["PC1"],k["PC1"]],[j["PC2"],k["PC2"]],'r-.')
plt.xlabel("PC1 (%.3f)"%pca.explained_variance_ratio_[0], fontsize = xlabelsize)
plt.ylabel("PC2 (%.3f)"%pca.explained_variance_ratio_[1], fontsize = ylabelsize)
if not titlesize is None:
plt.title(protocol, fontsize = titlesize)
else:
plt.title(protocol)
plt.legend(scatterpoints=3, bbox_to_anchor=(1.03, 1), borderaxespad=0.);
i=0
protocols = ['Smart-Seq2 rep1','Smart-Seq2 rep2','SMARTer']
for df1, df2 in zip(dfs1, dfs2):
plt.clf()
plot_Fig1D(df1, df2, protocol = protocols[i], xlabelsize = 18, ylabelsize=18, titlesize = 18)
plt.show()
i+=1
def plot_Fig1D(df1, df2, \
protocol = "SMARTer", \
key2 = 'Protocol2', \
key3 = 'BGISEQ-500', \
key4 = 'Fullname',\
xlabelsize = None, ylabelsize = None, titlesize = None, tp = 1):
from sklearn.decomposition import PCA
from sklearn.preprocessing import scale
df = pd.concat([df1,df2], axis = 0)
exprs = np.log1p(df.T.loc[df.columns.str.startswith("ENS")].astype(np.float32))
pca = PCA(n_components=50)
pca_res = pca.fit_transform(scale(exprs.T, 1))
df["PC1"] = pca_res[:,0]
df["PC2"] = pca_res[:,1]
if tp ==1:
g = df.groupby('Protocol2')
colors = ["#00cc99","#e0115f"]
for i, group in enumerate(g.indices):
tmp = g.indices[group]
plt.scatter(df.PC1.iloc[tmp], df.PC2.iloc[tmp], label=group.replace(protocol,""), lw = 0, c=colors[i])
for i in df[df[key2].str.startswith(key3)][key4]:
j = df[(df[key2].str.startswith(key3)) & (df[key4] == i)][['PC1','PC2']]
k = df[(~df[key2].str.startswith(key3)) & (df[key4] == i)][['PC1','PC2']]
plt.plot([j["PC1"],k["PC1"]],[j["PC2"],k["PC2"]],'k-.')
plt.legend(scatterpoints=3, bbox_to_anchor=(1.03, 1), borderaxespad=0.);
else:
if tp ==2:
df1 = df[df.columns[df.columns.str.startswith('ENS')]]
df['n_genes'] = np.sum(df1>0,axis=1)
plt.scatter(df.PC1, df.PC2, lw = 0, c=df['n_genes'])
plt.colorbar()
elif tp ==3:
df['Nanog'] = df['ENSMUSG00000012396.12']
plt.scatter(df.PC1, df.PC2, lw = 0, c=df['Nanog'])
plt.colorbar()
elif tp ==4:
df['Pou5f1'] = df['ENSMUSG00000024406.16']
plt.scatter(df.PC1, df.PC2, lw = 0, c=df['Pou5f1'])
plt.colorbar()
elif tp ==5:
df['Pax6'] = df['ENSMUSG00000027168.21']
plt.scatter(df.PC1, df.PC2, lw = 0, c=df['Pax6'])
plt.colorbar()
plt.xlabel("PC1 (%.3f)"%pca.explained_variance_ratio_[0], fontsize = xlabelsize)
plt.ylabel("PC2 (%.3f)"%pca.explained_variance_ratio_[1], fontsize = ylabelsize)
if not titlesize is None:
plt.title(protocol, fontsize = titlesize)
else:
plt.title(protocol)
i=0
protocols = ['Smart-Seq2 rep1','Smart-Seq2 rep2','SMARTer']
for df1, df2 in zip(dfs1, dfs2):
plt.clf()
plot_Fig1D(df1, df2, protocol = protocols[i], xlabelsize = 18, ylabelsize=18, titlesize = 18,tp=2)
plt.show()
i+=1
df.shape
ls ../data/GRCm38.cdna.all.symbol.tsv
df1 = pd.read_csv('../data/GRCm38.cdna.all.symbol.tsv', sep=' ',header=None, index_col=1)
df1.loc['Nanog']
df1.loc['Pou5f1']
df1.loc['Pax6']
df = pd.concat([dfs1[0],dfs2[0]])
from sklearn.decomposition import PCA
from sklearn.preprocessing import scale
exprs = np.log1p(df.T.loc[df.columns.str.startswith("ENS")].astype(np.float32))
pca = PCA(n_components=50)
pca_res = pca.fit_transform(scale(exprs.T, 1))
df["PC1"] = pca_res[:,0]
df["PC2"] = pca_res[:,1]
df[['PC1','PC2']].to_csv("PCA_cell_SM2rep1.csv")
df = pd.concat([dfs1[1],dfs2[1]])
from sklearn.decomposition import PCA
from sklearn.preprocessing import scale
exprs = np.log1p(df.T.loc[df.columns.str.startswith("ENS")].astype(np.float32))
pca = PCA(n_components=50)
pca_res = pca.fit_transform(scale(exprs.T, 1))
df["PC1"] = pca_res[:,0]
df["PC2"] = pca_res[:,1]
df[['PC1','PC2']].to_csv("PCA_cell_SM2rep2.csv")
df = pd.concat([dfs1[2],dfs2[2]])
from sklearn.decomposition import PCA
from sklearn.preprocessing import scale
exprs = np.log1p(df.T.loc[df.columns.str.startswith("ENS")].astype(np.float32))
pca = PCA(n_components=50)
pca_res = pca.fit_transform(scale(exprs.T, 1))
df["PC1"] = pca_res[:,0]
df["PC2"] = pca_res[:,1]
df[['PC1','PC2']].to_csv("PCA_cell_SMARTer.csv")
Nanog
i=0
protocols = ['Smart-Seq2 rep1','Smart-Seq2 rep2','SMARTer']
for df1, df2 in zip(dfs1, dfs2):
plt.clf()
plot_Fig1D(df1, df2, protocol = protocols[i], xlabelsize = 18, ylabelsize=18, titlesize = 18,tp=3)
plt.show()
i+=1
Pou5f1
i=0
protocols = ['Smart-Seq2 rep1','Smart-Seq2 rep2','SMARTer']
for df1, df2 in zip(dfs1, dfs2):
plt.clf()
plot_Fig1D(df1, df2, protocol = protocols[i], xlabelsize = 18, ylabelsize=18, titlesize = 18,tp=4)
plt.show()
i+=1
Pax6
i=0
protocols = ['Smart-Seq2 rep1','Smart-Seq2 rep2','SMARTer']
for df1, df2 in zip(dfs1, dfs2):
plt.clf()
plot_Fig1D(df1, df2, protocol = protocols[i], xlabelsize = 18, ylabelsize=18, titlesize = 18,tp=5)
plt.show()
i+=1
dfs = []
for f in iglob("../pdata1/*_ds_*"):
df = pd.read_csv(f, index_col=0)
df["id"] = df.index
df.index = df.study+"_"+df.id
df = df[["id",'Cell','Fullname','num_processed',\
'detection_limit_ERCC','accuracy_ERCC','Protocol','Batch','Protocol2']+
df.columns[df.columns.str.startswith('ENS')].tolist()]
dfs.append(df)
df = pd.concat(dfs,axis =0)
df.shape
df.head()
df.Protocol.value_counts()
df.Protocol2.value_counts()
df['Protocol2'] = df['Protocol2'].str.replace("Seuqencing",'Sequencing')
df = df[df.Protocol2.str.find("Sequencing")>0]
df.Protocol2.value_counts()
df.to_csv("../pdata1/Fig2B.csv")
df = pd.read_csv("../pdata1/Fig2B.csv", index_col=0)
df=df.replace([np.inf, -np.inf], np.nan).dropna(axis=0, how='any')
figsize(5,5)
matplotlib.rcParams.update({'font.size': 14})
names = ['BGISEQ-500 ESC Sequencing rep1',
'Hiseq4000 ESC Sequencing rep1',
'BGISEQ-500 K562 Sequencing rep1','Hiseq4000 K562 Sequencing rep1']
colors = [matplotlib.colors.rgb2hex(plt.get_cmap("Paired")(i)) for i in range(6,10)]
ax = fit_sensitivity(df,fun='np.log10(detection_limit_ERCC)', key1 ='detection_limit_ERCC',key2="num_processed",key3='Protocol2', \
yscale='log', xlabel = 'Sequenced reads per single-well',\
ylabel = 'Detection limit \n (# of molecules)', \
ylim = [0, 1e5], xlim = [500, 1e8], \
xlabelsize = 18, ylabelsize = 18, \
title = 'Sensitivity', titlesize = 20, names = names, colors = colors)
ax = fit_sensitivity(df, fun = 'accuracy_ERCC',key1="accuracy_ERCC",key2="num_processed",key3='Protocol2', \
xlabel = 'Sequenced reads per single-well',\
ylabel = 'Accuracy \n (pearson correlation \n coefficient)', \
ylim = [0, 1], xlim = [500, 1e8], \
xlabelsize = 18, ylabelsize = 18, \
title = 'Accuracy', titlesize = 20, names = names, colors = colors)
ax = fit_sensitivity(df, fun = 'accuracy_ERCC',key1="accuracy_ERCC",key2="num_processed",key3='Protocol2', \
xlabel = 'Sequenced reads per single-well',\
ylabel = 'Accuracy \n (pearson correlation \n coefficient)', \
ylim = [0, 1], xlim = [500, 1e8], \
xlabelsize = 18, ylabelsize = 18, \
title = 'Accuracy', titlesize = 20, names = names, colors = colors, colordots=True)
dfs = []
for f in ['bgi','sanger']:
df = pd.read_csv("../pdata1/%s.csv"%f, index_col=0)
dfs.append(df)
df = pd.concat(dfs, axis=0)
df.shape
df['Protocol2'].value_counts()
df1 = df[df.columns[df.columns.str.startswith('ENS')|df.columns.str.startswith('Protocol2')]]
df1 = df1.groupby('Protocol2').sum()
df1.to_csv("../pdata1/pseudobulk.csv")
df1 = pd.read_csv("../pdata1/pseudobulk.csv", index_col=0)
df1.head(2)
df1.index
l1 = df1.index[df1.index.str.startswith('BGI')].tolist()
l2 = df1.index[df1.index.str.startswith('Hi')]
l1 = ['BGISEQ-500 ESC Sequencing rep1',
'BGISEQ-500 K562 Seuqencing rep1',
'BGISEQ-500 SMARTer',
'BGISEQ-500 Smart-Seq2 rep1',
'BGISEQ-500 Smart-Seq2 rep2']
l2 = ['Hiseq4000 ESC Sequencing rep1',
'Hiseq4000 K562 Sequencing rep1',
'HiSeq-2000 SMARTer',
'HiSeq-2000 Smart-Seq2 rep1',
'HiSeq-2000 Smart-Seq2 rep2']
names = []
corrs = []
for i,j in zip(l1,l2):
cor = df1.loc[i].corr(df1.loc[j], method='pearson')
corrs.append(cor)
names.append(i.replace('BGISEQ-500 ',''))
df = pd.DataFrame({'name':names,'corr':corrs})
This is the pseudobulk table.
the calculation of detection_limit and accuracies need some changes in the analysis workflow.
df
df = pd.DataFrame(numpy.corrcoef(df1), index = df1.index, columns = df1.index)
This is an all-to-all correlation table.
df
df.index
cols =['BGISEQ-500 SMARTer', 'HiSeq-2000 SMARTer','BGISEQ-500 Smart-Seq2 rep1','HiSeq-2000 Smart-Seq2 rep1',
'BGISEQ-500 Smart-Seq2 rep2','HiSeq-2000 Smart-Seq2 rep2','BGISEQ-500 ESC Sequencing rep1',
'Hiseq4000 ESC Sequencing rep1', 'BGISEQ-500 K562 Seuqencing rep1',
'Hiseq4000 K562 Sequencing rep1']
cols =['BGISEQ-500 SMARTer', 'HiSeq-2000 SMARTer','BGISEQ-500 Smart-Seq2 rep1','HiSeq-2000 Smart-Seq2 rep1',
'BGISEQ-500 Smart-Seq2 rep2','HiSeq-2000 Smart-Seq2 rep2']
df = df[cols]
df = df.loc[cols]
figsize(6,6)
fig, ax = plt.subplots()
im = ax.imshow(df.values)
# We want to show all ticks...
ax.set_xticks(np.arange(df.shape[0]))
ax.set_yticks(np.arange(df.shape[0]))
# ... and label them with the respective list entries
ax.set_xticklabels(df.index)
ax.set_yticklabels(df.index)
# Rotate the tick labels and set their alignment.
plt.setp(ax.get_xticklabels(), rotation=45, ha="right",
rotation_mode="anchor")
# Loop over data dimensions and create text annotations.
for i in range(df.shape[0]):
for j in range(df.shape[0]):
text = ax.text(j, i, '%.2f'%df.iloc[i, j],
ha="center", va="center", color="w")
# ax.set_title("Harvest of local farmers (in tons/year)")
fig.tight_layout()
# plt.show()
plt.savefig("Heatmap.pdf")
figsize(6,6)
fig, ax = plt.subplots()
im = ax.imshow(df.values)
# We want to show all ticks...
ax.set_xticks(np.arange(df.shape[0]))
ax.set_yticks(np.arange(df.shape[0]))
# ... and label them with the respective list entries
ax.set_xticklabels(df.index)
ax.set_yticklabels(df.index)
# Rotate the tick labels and set their alignment.
plt.setp(ax.get_xticklabels(), rotation=45, ha="right",
rotation_mode="anchor")
# Loop over data dimensions and create text annotations.
for i in range(df.shape[0]):
for j in range(df.shape[0]):
text = ax.text(j, i, '%.2f'%df.iloc[i, j],
ha="center", va="center", color="w")
# ax.set_title("Harvest of local farmers (in tons/year)")
fig.tight_layout()
plt.show()
from matplotlib_venn import venn3
df = pd.read_csv("../pdata1/pseudobulk.csv", index_col=0)
df
df.shape
from matplotlib_venn import venn2
i=0
print(df.index[i])
venn2([set(df.loc[l1[i]][df.loc[l1[i]]>0].index), set(df.loc[l2[i]][df.loc[l2[i]]>0].index)])
i=1
print(df.index[i])
venn2([set(df.loc[l1[i]][df.loc[l1[i]]>0].index), set(df.loc[l2[i]][df.loc[l2[i]]>0].index)])
83A overlap with 84A. 83B overlap with 84B. overlap in 83 and overlap in 84.
# print(df.index[i])
name1 = 'BGISEQ-500 K562 Seuqencing rep1'
name2 = 'BGISEQ-500 ESC Sequencing rep1'
venn2([set(df.loc[name1][df.loc[name1]>10].index), set(df.loc[name2][df.loc[name2]>10].index)],set_labels=(name1,name2))
# print(df.index[i])
name1 = 'Hiseq4000 K562 Sequencing rep1'
name2 = 'Hiseq4000 ESC Sequencing rep1'
venn2([set(df.loc[name1][df.loc[name1]>10].index), set(df.loc[name2][df.loc[name2]>10].index)],set_labels=(name1,name2))
name1 = 'Hiseq4000 K562 Sequencing rep1'
name2 = 'Hiseq4000 ESC Sequencing rep1'
x1 = df.loc[name1][df.loc[name1]>10].index
x2 = df.loc[name2][df.loc[name2]>10].index
name1 = 'BGISEQ-500 K562 Seuqencing rep1'
name2 = 'BGISEQ-500 ESC Sequencing rep1'
x3 = df.loc[name1][df.loc[name1]>10].index
x4 = df.loc[name2][df.loc[name2]>10].index
overlap = x3[x3.isin(x4)]
pd.Series(overlap).to_csv("overlap.csv")
non1 = x3[~x3.isin(x4)]
pd.Series(non1).to_csv("k562only_genes.csv")
non2 = x4[~x4.isin(x3)]
pd.Series(non2).to_csv("ESConly_genes.csv")
x5 = x3[x3.isin(x1)]
x6 = x4[x4.isin(x2)]
venn2([set(x5), set(x6)],set_labels=('K562','ESC'))
i=2
print(df.index[i])
venn2([set(df.loc[l1[i]][df.loc[l1[i]]>0].index), set(df.loc[l2[i]][df.loc[l2[i]]>0].index)])
i=3
print(df.index[i])
venn2([set(df.loc[l1[i]][df.loc[l1[i]]>0].index), set(df.loc[l2[i]][df.loc[l2[i]]>0].index)])
i=4
print(df.index[i])
venn2([set(df.loc[l1[i]][df.loc[l1[i]]>0].index), set(df.loc[l2[i]][df.loc[l2[i]]>0].index)])
dfs = []
for f in ['bgi','sanger']:
df = pd.read_csv("../pdata1/%s.csv"%f, index_col=0)
dfs.append(df)
df = pd.concat(dfs, axis=0)
df1 = df[df.columns[df.columns.str.startswith('ENS')|df.columns.str.startswith('Protocol2')]]
df = df1.T
names = df.loc['Protocol2'].unique()
for s in df.loc['Protocol2'].unique():
# print(s)
df['%s_mean'%s] = np.mean(df.loc[:,df.loc['Protocol2'] == s].iloc[:-1,:],1)
df['%s_var'%s] = np.var(df.loc[:,df.loc['Protocol2'] == s].iloc[:-1,:],1)
df['%s_drop'%s] = np.sum((df.loc[:,df.loc['Protocol2'] == s].iloc[:-1,:]==0),1) \
/ df.loc[:,df.loc['Protocol2'] == s].shape[1]
df['%s_cv2'%s] = df['%s_var'%s]/(df['%s_mean'%s]**2)
df = df.iloc[:-1,:]
figsize(5,5)
plt.loglog()
for s in names:
plt.scatter(df['%s_mean'%s], df['%s_var'%s], s = 0.1, label = s)
plt.xlim(1e-6, 1e5)
plt.ylim(1e-9, 1e10);
plt.legend(bbox_to_anchor=(1.05, 1))
plt.xlabel("Means")
plt.ylabel("variances")
names
namet = [['BGISEQ-500 Smart-Seq2 rep2','HiSeq-2000 Smart-Seq2 rep2'],
['BGISEQ-500 Smart-Seq2 rep1','HiSeq-2000 Smart-Seq2 rep1'],
['BGISEQ-500 SMARTer','HiSeq-2000 SMARTer'],
['BGISEQ-500 K562 Seuqencing rep1','Hiseq4000 K562 Sequencing rep1',],
['BGISEQ-500 ESC Sequencing rep1','Hiseq4000 ESC Sequencing rep1']]
titles = ['Smart-Seq2 rep2',
'Smart-Seq2 rep1',
'SMARTer',
'K562',
'ESC']
figsize(5,5)
for tt, nameu in zip(titles,namet):
plt.clf()
plt.loglog()
for s in nameu:
plt.scatter(df['%s_mean'%s], df['%s_cv2'%s], s = 0.1, label = s)
plt.xlim(1e-6, 1e5)
plt.ylim(1e-2, 1e2);
plt.legend(bbox_to_anchor=(1.05, 1))
plt.title(tt)
plt.xlabel("Means")
plt.ylabel("$CV^2$")
plt.show()
df.head()
for tt, nameu in zip(titles,namet):
plt.clf()
plt.xscale('log')
for s in nameu:
plt.scatter(df['%s_mean'%s], df['%s_drop'%s], s = 0.1, label = s)
plt.xlim(1e-5, 1e5)
plt.ylim(-0.1, 1.1);
plt.legend(bbox_to_anchor=(1.05, 1))
plt.xlabel("Means")
plt.ylabel("Dropout rate")
plt.show()
dfs = []
for f in ['bgi','sanger']:
df = pd.read_csv("../pdata1/%s.csv"%f, index_col=0)
dfs.append(df)
df = pd.concat(dfs, axis=0)
df['Protocol2'] = df['Protocol2'].str.replace('Seuqencing','Sequencing')
dm = pd.read_csv("../data/GRCm38.cdna.all.symbol.tsv",index_col=1, sep=' ',header=None)
dn = pd.read_csv("../data/PlurinetGenes_Category.txt", sep = '\t', index_col=0)
dm = dm.loc[dn.index]
dm['funct_cat'] = dn['funct_cat']
dm = dm.dropna()
df = df[df.columns[df.columns.isin(dm[0])|df.columns.isin(['Protocol2','n_genes'])].tolist()]
dm['gene'] = dm.index
dm.index = dm[0]
dm.head()
df.head()
df.columns = ['n_genes']+dm.loc[df.columns[1:-1]]['gene'].tolist()+['Protocol2']
df.to_csv("../pdata1/Heatmap.csv")
dm.funct_cat.value_counts()
df = pd.read_csv("../pdata1/Heatmap.csv", index_col=0)
dm[dm.funct_cat == 'Differentiation']['gene'].tolist() + ['Protocol2','n_genes']
df.Protocol2.value_counts()
df = df[dm[dm.funct_cat == 'Differentiation']['gene'].tolist() + ['n_genes','Protocol2']]
left side is BGI
def make_heatmap(df, key = 'HiSeq-2000 SMARTer', cut=5000,title=''):
x = df[df.Protocol2.str.endswith(key)]
z = x.iloc[:,:-1]
z = z.sort_values(by='n_genes')
y = z.iloc[:,:-1]
plt.clf()
plt.pcolormesh(np.log1p(y.T),cmap='RdYlBu_r')
plt.ylabel('Genes', fontsize= 24)
plt.xlabel('Cells', fontsize= 24)
plt.axvline((x['n_genes']<cut).sum(), lw=2, c='r',linestyle='-')
plt.gca().set_yticks(np.arange(y.shape[1])+0.5, minor=True)
plt.gca().set_yticks([], minor=False)
plt.gca().set_yticklabels(x.columns[:-2], minor=True)
plt.gca().set_xticks([], minor=False)
plt.title(title)
plt.show()
red line is the cutoff of number of genes. 7500 for BGI and 5000 for Uk.
x = df[df.Protocol2.str.endswith('BGISEQ-500 ESC Sequencing rep1')]
z = x.sort_values(by='n_genes')
figsize(5,3)
for k in df.Protocol2.unique():
cut = 5000
if k.find('BGI')>-1:cut=7500
make_heatmap(df, key = k, cut=cut,title=k)
df = pd.read_csv("../pdata1/Heatmap.csv", index_col=0)
dm.funct_cat.value_counts()
df.Protocol2.value_counts()
df = df[dm[dm.funct_cat == 'Pluripotency']['gene'].tolist() + ['n_genes','Protocol2']]
figsize(5,5)
for k in df.Protocol2.unique():
cut = 5000
if k.find('BGI')>0:cut=7500
make_heatmap(df, key = k, cut=cut,title=k)
figsize(5,5)
make_heatmap(df, key = 'Smart-Seq2 rep1')
figsize(5,5)
make_heatmap(df, key = 'Smart-Seq2 rep2')
figsize(5,5)
make_heatmap(df, key = 'SMARTer')
figsize(5,5)
make_heatmap(df, key = 'K562 Sequencing rep1')
figsize(5,5)
make_heatmap(df, key = 'ESC Sequencing rep1')
dfs = []
for f in ['bgi','sanger']:
df = pd.read_csv("../pdata1/%s.csv"%f, index_col=0)
dfs.append(df)
df = pd.concat(dfs, axis=0)
df['Protocol2'] = df['Protocol2'].str.replace('Seuqencing','Sequencing')
df1 = pd.read_csv("../pdata1/pseudobulk.csv", index_col=0)
key = 'Smart-Seq2 rep1'
df2 = df1[df1.index.str.endswith(key)]
gl = df2.columns[((df2.iloc[0]>0)&(df2.iloc[1]==0))].tolist()+ \
df2.columns[((df2.iloc[1]>0)&(df2.iloc[0]==0))].tolist()
df3 = df[gl + ['Protocol2']]
df2 = df3[df3.Protocol2.str.endswith(key)]
df2 = df2.iloc[:,list(arange(100))+list(arange(-1, -102, -1))[::-1]]
figsize(10,10)
make_heatmap(df2, key = key)
key = 'Smart-Seq2 rep2'
df2 = df1[df1.index.str.endswith(key)]
gl = df2.columns[((df2.iloc[0]>0)&(df2.iloc[1]==0))].tolist()+ \
df2.columns[((df2.iloc[1]>0)&(df2.iloc[0]==0))].tolist()
df3 = df[gl + ['Protocol2']]
df2 = df3[df3.Protocol2.str.endswith(key)]
df2 = df2.iloc[:,list(arange(100))+list(arange(-1, -102, -1))[::-1]]
figsize(10,10)
make_heatmap(df2, key = key)
key = 'SMARTer'
df2 = df1[df1.index.str.endswith(key)]
gl = df2.columns[((df2.iloc[0]>0)&(df2.iloc[1]==0))].tolist()+ \
df2.columns[((df2.iloc[1]>0)&(df2.iloc[0]==0))].tolist()
df3 = df[gl + ['Protocol2']]
df2 = df3[df3.Protocol2.str.endswith(key)]
df2 = df2.iloc[:,list(arange(100))+list(arange(-1, -102, -1))[::-1]]
figsize(10,10)
make_heatmap(df2, key = key)
df1.index = df1.index.str.replace('Seuqencing','Sequencing')
key = 'K562 Sequencing rep1'
df2 = df1[df1.index.str.endswith(key)]
gl = df2.columns[((df2.iloc[0]>0)&(df2.iloc[1]==0))].tolist()+ \
df2.columns[((df2.iloc[1]>0)&(df2.iloc[0]==0))].tolist()
df3 = df[gl + ['Protocol2']]
df2 = df3[df3.Protocol2.str.endswith(key)]
df2 = df2.iloc[:,list(arange(100))+list(arange(-1, -102, -1))[::-1]]
figsize(10,10)
make_heatmap(df2, key = key)
key = 'ESC Sequencing rep1'
df2 = df1[df1.index.str.endswith(key)]
gl = df2.columns[((df2.iloc[0]>0)&(df2.iloc[1]==0))].tolist()+ \
df2.columns[((df2.iloc[1]>0)&(df2.iloc[0]==0))].tolist()
df3 = df[gl + ['Protocol2']]
df2 = df3[df3.Protocol2.str.endswith(key)]
df2 = df2.iloc[:,list(arange(100))+list(arange(-1, -102, -1))[::-1]]
figsize(10,10)
make_heatmap(df2, key = key)
df1 = pd.read_csv("../data/fLen_bgi.csv", index_col=0)
df2 = pd.read_csv("../data/fLen_sanger.csv", index_col=0)
df = pd.concat([df1,df2])
df.index = df.index.str.split('/').str[-1]
df.head()
df['Protocol2'] = phn.loc[df.index]['Protocol2']
df = df[~df.Protocol2.isna()]
df.shape
df.columns = list(arange(1001)+1)+['Protocol2']
df['Protocol2'] = df['Protocol2'].str.replace('Seuqencing','Sequencing')
df.Protocol2.value_counts()
colors = [matplotlib.colors.rgb2hex(plt.get_cmap("Paired")(i)) for i in range(0,10)]
cc = dict(zip(df.Protocol2.unique(),colors))
figsize(5,5)
for i in arange(df.shape[0]):
plt.scatter(arange(1001)+1, df.iloc[i][:-1], s = 0.1, label = df.iloc[i][-1], c = cc[df.iloc[i][-1]])
plt.xlabel('length %', fontsize=18)
plt.ylabel('density', fontsize=18)
plt.xticks([0,200,400,600,800,1000],[0,20,40,60,80,100])
figsize(5,5)
df1 = df[df.Protocol2.str.endswith('Smart-Seq2 rep1')]
for i in arange(df1.shape[0]):
plt.scatter(arange(1001)+1, df1.iloc[i][:-1], s = 0.1, label = df1.iloc[i][-1], c = cc[df1.iloc[i][-1]])
plt.xlabel('length %', fontsize=18)
plt.ylabel('density', fontsize=18)
plt.xticks([0,200,400,600,800,1000],[0,20,40,60,80,100])
figsize(5,5)
df1 = df[df.Protocol2.str.endswith('Smart-Seq2 rep2')]
for i in arange(df1.shape[0]):
plt.scatter(arange(1001)+1, df1.iloc[i][:-1], s = 0.1, label = df1.iloc[i][-1], c = cc[df1.iloc[i][-1]])
plt.xlabel('length %', fontsize=18)
plt.ylabel('density', fontsize=18)
plt.xticks([0,200,400,600,800,1000],[0,20,40,60,80,100])
figsize(5,5)
df1 = df[df.Protocol2.str.endswith('SMARTer')]
for i in arange(df1.shape[0]):
plt.scatter(arange(1001)+1, df1.iloc[i][:-1], s = 0.1, label = df1.iloc[i][-1], c = cc[df1.iloc[i][-1]])
plt.xlabel('length %', fontsize=18)
plt.ylabel('density', fontsize=18)
plt.xticks([0,200,400,600,800,1000],[0,20,40,60,80,100])
figsize(5,5)
df1 = df[df.Protocol2.str.endswith('ESC Sequencing rep1')]
for i in arange(df1.shape[0]):
plt.scatter(arange(1001)+1, df1.iloc[i][:-1], s = 0.1, label = df1.iloc[i][-1], c = cc[df1.iloc[i][-1]])
plt.xlabel('length %', fontsize=18)
plt.ylabel('density', fontsize=18)
plt.xticks([0,200,400,600,800,1000],[0,20,40,60,80,100])
figsize(5,5)
df1 = df[df.Protocol2.str.endswith('K562 Sequencing rep1')]
for i in arange(df1.shape[0]):
plt.scatter(arange(1001)+1, df1.iloc[i][:-1], s = 0.1, label = df1.iloc[i][-1], c = cc[df1.iloc[i][-1]])
plt.xlabel('length %', fontsize=18)
plt.ylabel('density', fontsize=18)
plt.xticks([0,200,400,600,800,1000],[0,20,40,60,80,100])
df1 = pd.read_csv("../data/coverage_bgi.csv", index_col=0)
df2 = pd.read_csv("../data/coverage_sanger.csv", index_col=0)
phn = pd.read_csv("../data/phn.csv",index_col=0)
df = pd.concat([df1,df2], axis=0)
df.index = df.index.str.split('/').str[-1]
df.head()
df['Protocol2'] = phn.loc[df.index]['Protocol2']
df = df[~df.Protocol2.isna()]
df.shape
df.columns = list(np.linspace(0, 100, 20))+['Protocol2']
df['Protocol2'] = df['Protocol2'].str.replace('Seuqencing','Sequencing')
df.Protocol2.value_counts()
colors = [matplotlib.colors.rgb2hex(plt.get_cmap("Paired")(i)) for i in range(0,10)]
cc = dict(zip(df.Protocol2.unique(),colors))
figsize(5,5)
for i in arange(df.shape[0]):
plt.plot(np.linspace(0, 100, 20), df.iloc[i][:-1], label = df.iloc[i][-1], c = cc[df.iloc[i][-1]])
plt.xlabel('length %', fontsize=18)
plt.ylabel('density', fontsize=18)
# plt.xticks([0,200,400,600,800,1000],[0,20,40,60,80,100])
figsize(5,5)
df1 = df[df.Protocol2.str.endswith('Smart-Seq2 rep1')]
for i in arange(df1.shape[0]):
plt.plot(np.linspace(0, 100, 20), df1.iloc[i][:-1], label = df1.iloc[i][-1], c = cc[df1.iloc[i][-1]],\
alpha=0.2)
plt.xlabel('length %', fontsize=18)
plt.ylabel('density', fontsize=18)
figsize(5,5)
df1 = df[df.Protocol2.str.endswith('Smart-Seq2 rep2')]
for i in arange(df1.shape[0]):
plt.plot(np.linspace(0, 100, 20), df1.iloc[i][:-1], label = df1.iloc[i][-1], c = cc[df1.iloc[i][-1]],\
alpha=0.2)
plt.xlabel('length %', fontsize=18)
plt.ylabel('density', fontsize=18)
figsize(5,5)
df1 = df[df.Protocol2.str.endswith('SMARTer')]
for i in arange(df1.shape[0]):
plt.plot(np.linspace(0, 100, 20), df1.iloc[i][:-1], label = df1.iloc[i][-1], c = cc[df1.iloc[i][-1]],\
alpha=0.2)
plt.xlabel('length %', fontsize=18)
plt.ylabel('density', fontsize=18)
figsize(5,5)
df1 = df[df.Protocol2.str.endswith('ESC Sequencing rep1')]
for i in arange(df1.shape[0]):
plt.plot(np.linspace(0, 100, 20), df1.iloc[i][:-1], label = df1.iloc[i][-1], c = cc[df1.iloc[i][-1]],\
alpha=0.2)
plt.xlabel('length %', fontsize=18)
plt.ylabel('density', fontsize=18)
figsize(5,5)
df1 = df[df.Protocol2.str.endswith('K562 Sequencing rep1')]
for i in arange(df1.shape[0]):
plt.plot(np.linspace(0, 100, 20), df1.iloc[i][:-1], label = df1.iloc[i][-1], c = cc[df1.iloc[i][-1]],\
alpha=0.2)
plt.xlabel('length %', fontsize=18)
plt.ylabel('density', fontsize=18)
phn = pd.read_csv("../data/phn.csv",index_col=0)
df = pd.read_csv("../data/ASE_stat.csv", index_col=0, sep='\t')
df.columns = ['ASE']
df = df.loc[df.index.isin(phn.index)]
phn['ASE'] = 0
ds = phn['ASE']
ds.loc[df.index] = df['ASE'].tolist()
phn['ASE'] = ds
phn['Protocol2'] = phn['Protocol2'].str.replace("Seuqencing",'Sequencing')
phn.Protocol2.value_counts()
df = phn[(phn['Protocol2'].str.startswith('BGISEQ-500 K562')|phn['Protocol2'].str.startswith('Hiseq4000 K562'))]
df = df[df['Protocol2'].str.endswith('rep1')]
df.Protocol2.value_counts()
ax=sns.violinplot(x='Protocol2',y='ASE',data=df,inner=None, color=".9",cut=0)
ax.set_xticklabels(ax.get_xticklabels(),rotation=90)
ax = sns.stripplot(x='Protocol2',y='ASE',data=df, jitter=0.3)
df['Read_type'].value_counts()
ax=sns.violinplot(x='Read_type',y='ASE',data=df,inner=None, color=".9",cut=0)
ax.set_xticklabels(ax.get_xticklabels(),rotation=90)
ax = sns.stripplot(x='Read_type',y='ASE',data=df, jitter=0.3)
e6
phn = pd.read_csv("../data/phn.csv",index_col=0)
df = pd.read_table("../data/AS_stat.xls", index_col=0, sep='\t')
df.columns = ['ASE']
df = df.loc[df.index.isin(phn.index)]
phn['ASE'] = 0
ds = phn['ASE']
ds.loc[df.index] = df['ASE'].tolist()
phn['ASE'] = ds
ds.sum()
ls ../data/AS*
phn['Protocol2'] = phn['Protocol2'].str.replace("Seuqencing",'Sequencing')
phn.Protocol2.value_counts()
# df = phn[(phn['Protocol2'].str.startswith('BGISEQ-500 K562')|phn['Protocol2'].str.startswith('Hiseq4000 K562'))]
df = phn
df = df[df['Protocol2'].str.endswith('rep1')]
df.Protocol2.value_counts()
ax=sns.violinplot(x='Read_type',y='ASE',data=df,inner=None, color=".9",cut=0)
ax.set_xticklabels(ax.get_xticklabels(),rotation=90)
ax = sns.stripplot(x='Read_type',y='ASE',data=df, jitter=0.3)
df = phn[(phn['Protocol2'].str.startswith('BGISEQ-500 ESC')|phn['Protocol2'].str.startswith('Hiseq4000 ESC'))]
df = df[df['Protocol2'].str.endswith('rep1')]
df.Protocol2.value_counts()
ax=sns.violinplot(x='Protocol2',y='ASE',data=df,inner=None, color=".9",cut=0)
ax.set_xticklabels(ax.get_xticklabels(),rotation=90)
ax = sns.stripplot(x='Protocol2',y='ASE',data=df, jitter=0.3)
phn = pd.read_csv("../pdata/bgi_human_phn.csv", index_col=0)
phn = pd.read_csv("../pdata/bgi_mouse_phn.csv", index_col=0)
phn.head()
phn.Protocol2.value_counts()
phn = phn[phn.Protocol2.str.startswith('BGISEQ-500 K562')|phn.Protocol2.str.startswith('BGISEQ-500 ESC')|\
phn.Protocol2.str.startswith('Hiseq4000 K562')|phn.Protocol2.str.startswith('Hiseq4000 ESC')]
phn.shape
phn.Protocol2.value_counts()
ls ../pdata
ls ../pdata1
df = pd.read_csv("../pdata/bgi.csv", index_col=0)
df.head()
df.shape
df = df.loc[phn.index]
df.shape
df = df[~df.num_processed.isna()]
df
df1 = df.replace([np.inf, -np.inf], np.nan).dropna(subset=["detection_limit_SIRV"], how="all")
df1
ls ../supinfo/
from glob import iglob
for f in iglob('../supinfo/Tab*.csv'):
df = pd.read_csv(f,index_col=0)
fo = f.replace(".csv",'.1.csv')
df = df[df.columns[~df.columns.str.startswith('ENS')]]
df.to_csv(fo)